NBA PCA analysis

Looking at similarities between NBA players from the 2015-2016 season

Roupen Khanjian
01-25-2021
library(tidyverse) # Easily Install and Load the 'Tidyverse', CRAN v1.3.0
library(janitor) # Simple Tools for Examining and Cleaning Dirty Data, CRAN v2.1.0
library(here) # A Simpler Way to Find Your Files, CRAN v1.0.1
library(scales) # Scale Functions for Visualization, CRAN v1.1.1
library(ggfortify) # Data Visualization Tools for Statistical Analysis Results, CRAN v0.4.11

library(gghighlight) # Highlight Lines and Points in 'ggplot2', CRAN v0.3.1

library(plotly) # Create Interactive Web Graphics via 'plotly.js', CRAN v4.9.3

Brief Introduction to Data

The data used for this task was obtained from the following link: data. I decided to analyze data from the National Basketball Association (NBA) player statistics from the 2015-2016 season. Each observation in this dataset is a player’s per game statistics. I choose to use PCA in order to see how the players differed across 11 features that are deemed to be important for a basketball player’s success.

Data Wrangling

nba_players <- read_csv(here("_texts", 
                             "NBA_PCA",
                             "data", "nba_players.csv")) %>% 
  clean_names() %>% 
  separate(player, into = c("player", "html"), sep = "\\\\") %>% # clean the player name column
  dplyr::filter(mp > 18) %>% # filter for players who played over 18 minutes a game (out of a possible 48)
  dplyr::filter(g > 30) %>% # filter for players who played over 30 games (out of a possible 82)
  drop_na(age, fga, e_fg_percent, ft_percent, trb:pts)  # drop observations with missing values 

# Quick look at the data
nba_players %>%
  dplyr::select(player, pos, age, fga, e_fg_percent, ft_percent, trb:pts) %>% 
  slice(1:5)
# A tibble: 5 x 13
  player   pos     age   fga e_fg_percent ft_percent   trb   ast   stl
  <chr>    <chr> <dbl> <dbl>        <dbl>      <dbl> <dbl> <dbl> <dbl>
1 Steven … C        22   5.3        0.613      0.582   6.7   0.8   0.5
2 Arron A… SG       30  11.3        0.5        0.84    3.7   2     0.4
3 LaMarcu… PF       30  14.1        0.513      0.858   8.5   1.5   0.5
4 Lavoy A… PF       26   4.7        0.516      0.63    5.4   1     0.3
5 Tony Al… SG       34   7.3        0.474      0.652   4.6   1.1   1.7
# … with 4 more variables: blk <dbl>, tov <dbl>, pf <dbl>, pts <dbl>

PCA

nba_players_pca <-  nba_players %>%  
  dplyr::select(age, fga, e_fg_percent, ft_percent, trb:pts) %>% # select the features for pca
  scale() %>% # scale the features
  prcomp() # run pca

Biplot

autoplot(nba_players_pca,
         data = nba_players,
         loadings = TRUE,
         loadings.label = TRUE,
         loadings.colour = "khaki2",
         loadings.label.colour = "black",
         loadings.label.fontface = "bold",
         colour = "pos" # organize colors based off position
         ) +
  labs(title = "Biplot for PCA",
       caption = "Biplot of NBA players basic statistics from the 2015-2016 NBA season.\n Colors are organized by position.") +
  theme_minimal() +
  theme(axis.title = element_text(face = "bold", size = 12),
        panel.grid.minor = element_blank(),
        plot.title = element_text(face = "bold", size = 13)
        )

A few observations from the above biplot:

Biplot Highlighting a Few Players

Below is the same biplot but I decided to highlight the 5 best players for that season (according to the MVP voting which can be found here: MVP voting) .

autoplot(nba_players_pca,
         data = nba_players,
         loadings = TRUE,
         loadings.label = TRUE,
         loadings.colour = "khaki2",
         loadings.label.colour = "black",
         loadings.label.fontface = "bold",
         colour = "player"
         ) +
  labs(title = "Biplot for PCA",
       subtitle = "Top 5 players in MVP Voting are Highlighted",
       caption = "Biplot highlighting some of the best players for the 2015-2016 NBA season") +
  gghighlight(player %in% c("Kawhi Leonard", "Stephen Curry", "LeBron James",
                            "Russell Westbrook", "Kevin Durant")) + # top 5 players in MVP voting
  theme_minimal() +
  theme(axis.title = element_text(face = "bold", size = 12),
        panel.grid.minor = element_blank(),
        plot.title = element_text(face = "bold", size = 13),
        plot.subtitle = element_text(size = 11)
        )

Biplot Using plotly to see Similarities Between Players

Lastly, in order to see which players are similar to one another I made an interactive plot where you can hover over each data point to revel the name of the player.

nba_pca_plot <- autoplot(nba_players_pca,
         data = nba_players,
         loadings = TRUE,
         loadings.label = TRUE,
         loadings.colour = "khaki2",
         loadings.label.colour = "black",
         loadings.label.fontface = "bold",
         colour = "player", # organize colors based off position,
         colour.show.legend = FALSE
         ) +
  labs(title = "Interactive Biplot") +
  theme_minimal() +
  theme(axis.title = element_text(face = "bold", size = 12),
        panel.grid.minor = element_blank(),
        legend.position="none",
        plot.title = element_text(face = "bold", size = 13)
        )

ggplotly(nba_pca_plot, tooltip = "player") # interactive plot